This analysis focuses on the churn data for a telecommunication company. By understanding the dataset’s structure, patterns, and possible correlations, we aim to derive insights that can predict customer customer churn behavior and guide future customer retention strategies.
Let’s load the dataset and take a preliminary look at the first few rows of the dataset.
churn <- read.csv("../data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
DT::datatable(
head(churn),
options = list(scrollX = TRUE)
)
Understanding the structure and some basic statistics can give us an overview of the data’s nature.
knitr::kable(str(churn))
'data.frame': 7043 obs. of 21 variables:
$ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
$ gender : chr "Female" "Male" "Male" "Male" ...
$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
$ Partner : chr "Yes" "No" "No" "No" ...
$ Dependents : chr "No" "No" "No" "No" ...
$ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
$ PhoneService : chr "No" "Yes" "Yes" "No" ...
$ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
$ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
$ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
$ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
$ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
$ TechSupport : chr "No" "No" "No" "Yes" ...
$ StreamingTV : chr "No" "No" "No" "No" ...
$ StreamingMovies : chr "No" "No" "No" "No" ...
$ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
$ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
$ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
$ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
$ Churn : chr "No" "No" "Yes" "No" ...
knitr::kable(summary(churn))
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:7043 | Length:7043 | Min. :0.0000 | Length:7043 | Length:7043 | Min. : 0.00 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Length:7043 | Min. : 18.25 | Min. : 18.8 | Length:7043 | |
| Class :character | Class :character | 1st Qu.:0.0000 | Class :character | Class :character | 1st Qu.: 9.00 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 35.50 | 1st Qu.: 401.4 | Class :character | |
| Mode :character | Mode :character | Median :0.0000 | Mode :character | Mode :character | Median :29.00 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 70.35 | Median :1397.5 | Mode :character | |
| NA | NA | Mean :0.1621 | NA | NA | Mean :32.37 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Mean : 64.76 | Mean :2283.3 | NA | |
| NA | NA | 3rd Qu.:0.0000 | NA | NA | 3rd Qu.:55.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.: 89.85 | 3rd Qu.:3794.7 | NA | |
| NA | NA | Max. :1.0000 | NA | NA | Max. :72.00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Max. :118.75 | Max. :8684.8 | NA | |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :11 | NA |
Here is a brief explanation of each variable:
We’ve identified 11 missing values in the TotalCharges
column. Upon closer examination, these entries correspond to customers
with less than one month of tenure. This indicates that charges are
recorded only at the end of a billing cycle or month. For customers in
their initial month, the logical approach is to replace the missing
TotalCharges values with the respective values from the
MonthlyCharges column. This is based on the understanding
that, for these customers, the total charges accrued until that point
would equate to their single month’s charge.
churn$TotalCharges[is.na(churn$TotalCharges)] <- churn$MonthlyCharges[is.na(churn$TotalCharges)]
par(mfrow=c(1,2))
boxplot(churn$MonthlyCharges, main="Boxplot - Monthly Charges", col="#99CC99")
boxplot(churn$TotalCharges, main="Boxplot - Total Charges", col="#9999CC")
Upon visual inspection, there does not seem to appear to be any conspicuous outliers in these columns. For completeness, let us use the Inter-Quartile Range (IQR) as a measure for detecting outliers.
detect_outliers <- function(data, column_name) {
# Check if column_name exists in data
if(!(column_name %in% names(data))) {
stop(paste("The column", column_name, "does not exist in the provided data frame."))
}
# Extract column values
column_values <- data[[column_name]]
# Calculate IQR for the column
IQR_value <- IQR(column_values)
# Calculate lower and upper bounds
lower_bound <- quantile(column_values, 0.25) - 1.5 * IQR_value
upper_bound <- quantile(column_values, 0.75) + 1.5 * IQR_value
# Detect outliers
outliers <- column_values[column_values < lower_bound | column_values > upper_bound]
return(outliers)
}
MonthlyChargesOutliers <- detect_outliers(churn, "MonthlyCharges")
TotalChargesOutliers <- detect_outliers(churn, "TotalCharges")
print(MonthlyChargesOutliers)
numeric(0)
print(TotalChargesOutliers)
numeric(0)
Given our current analysis, there are no glaring outliers in the dataset. This is excellent as it simplifies the preprocessing stage and assures us that our subsequent analyses won’t be heavily skewed by extreme values.
tenure_plot <- ggplot(churn, aes(x=tenure)) +
geom_histogram(binwidth=5, fill="#66c2a5", color="white", alpha=0.7) +
labs(title="Distribution of Tenure", x="Tenure", y="Count") +
theme_minimal()
print(tenure_plot)
Insight: A significant number of customers are discontinuing their service within the initial months. The recent peak observed in the last month should be interpreted as customers who are currently active and have not yet terminated their relationship with the telecom provider.
monthlyCharges_plot <- ggplot(churn, aes(x=MonthlyCharges)) +
geom_histogram(binwidth=10, fill="#fc8d62", color="white", alpha=0.7) +
labs(title="Distribution of Monthly Charges", x="Monthly Charges", y="Count") +
theme_minimal()
print(monthlyCharges_plot)
Insight: This is a heavy tailed
distribution resembles the log-Cauchy probability model
distribution which is generally used in processes where significant
outliers or extreme results may occur.
totalCharges_plot <- ggplot(churn, aes(x=TotalCharges)) +
geom_histogram(binwidth=100, fill="#8da0cb", color="white", alpha=0.7) +
labs(title="Distribution of Total Charges", x="Total Charges", y="Count") +
theme_minimal()
print(totalCharges_plot)
Insight: The plot of total charges for the telecom company appears to follow an exponential distribution. In such a distribution:
Initial Rapid Growth: At the beginning of the distribution, the total charges are low, indicating that a significant number of customers are likely on lower-cost plans or have just recently subscribed to the service.
Exponential Decay: As we move to the right on the plot, the total charges increase exponentially. This suggests that there is a substantial number of customers who have higher-cost plans, additional services, or have been with the company for an extended period. These customers contribute significantly to the company’s revenue.
churn_dist_plot <- ggplot(churn, aes(x=Churn)) +
geom_bar(fill="#e78ac3") +
labs(title="Churn Distribution", x="Churn", y="Count") +
theme_minimal()
print(churn_dist_plot)
tenure_histogram <- ggplot(subset(churn, Churn == "Yes"), aes(x = tenure)) +
geom_histogram(binwidth = 5, fill = "#e78ac3") + # You can adjust binwidth as needed
labs(title = "Tenure Distribution of Churned Customers", x = "Tenure", y = "Count") +
theme_minimal()
# Print and save the plot
print(tenure_histogram)
Insight: Around 26% of the customers have discontinued the service.
Exponential Decay: The plot shows a rapid initial decline in the number of churned customers as tenure increases. This indicates that a significant proportion of customers tend to churn relatively early in their relationship with the company.
Long Tail: As tenure increases, the number of churned customers declines at a slower rate. This long tail of the distribution suggests that some customers remain with the company for a more extended period before ultimately churning.
churn <- churn %>%
mutate(SeniorCitizen=ifelse(SeniorCitizen==1, "Yes", "No"))
demographic_data <- churn %>%
select(gender, SeniorCitizen, Partner, Dependents) %>%
gather(key = "Variable", value = "Value")
phone_services_data <- churn %>%
select(PhoneService, MultipleLines) %>%
gather(key = "Variable", value = "Value")
internet_services_data <- churn %>%
select(OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies) %>%
gather(key = "Variable", value = "Value")
payment_options_data <- churn %>%
select(PaperlessBilling, PaymentMethod) %>%
gather(key = "Variable", value = "Value")
# Create a plotting function for consistency
plot_data <- function(data){
ggplot(data, aes(x=Value, fill=Value)) +
geom_bar() +
facet_wrap(~ Variable, scales = "free") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom") +
scale_fill_brewer(palette="Set3") # Using a color palette from RColorBrewer package
}
# Create the plots
# Create and print the plots
demographic_plot <- plot_data(demographic_data)
print(demographic_plot)
phone_services_plot <- plot_data(phone_services_data)
print(phone_services_plot)
internet_services_plot <- plot_data(internet_services_data)
print(internet_services_plot)
payment_options_plot <- plot_data(payment_options_data)
print(payment_options_plot)
These plots indicate that there is considerable class-imbalance in the data. Hence we need to be cautious as we might need to employ techniques like oversampling, undersampling or regularization to circumvent this issue training using deep learning models.
churn_non_churn_tot_charge <- churn %>%
ggplot(aes(x = TotalCharges, fill = Churn)) +
geom_density(alpha = 0.8) +
scale_fill_manual(values = c('red', 'yellow')) +
labs(title = 'Total Charges Density Split of Churn vs Non-Churn')
print(churn_non_churn_tot_charge)
churn_by_internetService_plot <- churn %>%
group_by(InternetService) %>%
summarize(Churn_Rate = mean(Churn == "Yes")) %>%
ggplot(aes(x=InternetService, y=Churn_Rate)) +
geom_col(fill="#66c2a5") +
labs(title="Churn Rate by Internet Service", x="Internet Service", y="Churn Rate") +
theme_minimal()
print(churn_by_internetService_plot)
To showcase the popularity of various products, we’ve evaluated
InternetService, OnlineSecurity,
OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, and
StreamingMovies. We’ve condensed these product columns into
a single column for simplified visualization-
product_data <- churn %>%
select(InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies) %>%
gather(product, type) %>%
filter(!grepl("No internet service", type)) %>%
filter(type == "Yes")
# Count occurrences of each product type
product_count <- product_data %>%
group_by(product, type) %>%
count(type, sort=TRUE) %>%
arrange(desc(n)) %>%
ungroup() %>%
select(-type)
## Plot the data
product_plot <- ggplot(product_count, aes(x = reorder(product, -n), y = n, fill = product)) +
geom_bar(stat = "identity", width = 0.6) + # Adjust the width here
labs(title = "Product Usage Count",
x = "Product",
y = "Count") +
theme_minimal() +
theme(legend.position = "none") +
coord_flip() +
scale_fill_brewer(palette = "Set3")
print(product_plot)
This tells us that the most popular products purchased by customers are internet streaming services, followed by online backups, device protection and tech support.
After visualizing the data, we need to perform several manipulations to the data to ensure a precise analysis.
SeniorCitizen should be encoded as a
factor.Column: gender
Female Male
0 1
Column: SeniorCitizen
No Yes
0 1
Column: Partner
Yes No
0 1
Column: Dependents
No Yes
0 1
Column: PhoneService
No Yes
0 1
Column: MultipleLines
No phone service No Yes
0 1 2
Column: InternetService
DSL Fiber optic No
0 1 2
Column: OnlineSecurity
No Yes No internet service
0 1 2
Column: OnlineBackup
Yes No No internet service
0 1 2
Column: DeviceProtection
No Yes No internet service
0 1 2
Column: TechSupport
No Yes No internet service
0 1 2
Column: StreamingTV
No Yes No internet service
0 1 2
Column: StreamingMovies
No Yes No internet service
0 1 2
Column: Contract
Month-to-month One year Two year
0 1 2
Column: PaperlessBilling
Yes No
0 1
Column: PaymentMethod
Electronic check Mailed check Bank transfer (automatic)
0 1 2
Credit card (automatic)
3
Column: Churn
No Yes
0 1
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7590-VHVEG | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29.85 | 29.85 | 0 |
| 5575-GNVDE | 1 | 0 | 1 | 0 | 34 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 56.95 | 1889.50 | 0 |
| 3668-QPYBK | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 53.85 | 108.15 | 1 |
| 7795-CFOCW | 1 | 0 | 1 | 0 | 45 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 2 | 42.30 | 1840.75 | 0 |
| 9237-HQITU | 0 | 0 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 70.70 | 151.65 | 1 |
| 9305-CDSKC | 0 | 0 | 1 | 0 | 8 | 1 | 2 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 99.65 | 820.50 | 1 |
Market Basket Analysis provides insights into products that tend to be purchased together. This can show us patterns in purchasing behaviors and can be useful for marketing strategies like product bundling and recommendations. For this purpose, we use the Apriori algorithm.
transaction_data <- as(
churn %>%
select(OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies), "transactions")
rules <- apriori(transaction_data, parameter = list(supp = 0.01, conf = 0.1))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.1 0.1 1 none FALSE TRUE 5 0.01 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 70
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[18 item(s), 7043 transaction(s)] done [0.00s].
sorting and recoding items ... [18 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 5 6 done [0.00s].
writing ... [2711 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
inspect(head(sort(rules, by = "lift")))
lhs rhs support confidence coverage
[1] {StreamingMovies=2} => {StreamingTV=2} 0.216669 1 0.216669
[2] {StreamingTV=2} => {StreamingMovies=2} 0.216669 1 0.216669
[3] {StreamingMovies=2} => {TechSupport=2} 0.216669 1 0.216669
[4] {TechSupport=2} => {StreamingMovies=2} 0.216669 1 0.216669
[5] {StreamingMovies=2} => {DeviceProtection=2} 0.216669 1 0.216669
[6] {DeviceProtection=2} => {StreamingMovies=2} 0.216669 1 0.216669
lift count
[1] 4.615334 1526
[2] 4.615334 1526
[3] 4.615334 1526
[4] 4.615334 1526
[5] 4.615334 1526
[6] 4.615334 1526
The rules are generated based on association rule mining. Each rule consists of an antecedent (lhs or left-hand side) and a consequent (rhs or right-hand side). The statistics provided (support, confidence, coverage, lift, and count) help in understanding the strength and relevance of these rules.
Rule 1: Customers who do not have
OnlineBackup, DeviceProtection,
TechSupport, StreamingTV, and
StreamingMovies are very likely (with confidence of 1 or
100%) to have OnlineSecurity=Yes. This association occurs
in about 3.63% of the transactions (support) and is 2.39 times more
frequent than if the services were purchased independently
(lift).
Rule 2: Customers who do not avail
OnlineSecurity, OnlineBackup,
DeviceProtection, StreamingTV, and
StreamingMovies are extremely likely (with confidence of 1
or 100%) to opt for TechSupport=Yes. This pattern is
observed in about 2.11% of the transactions.
Rule 3: Customers lacking services like
OnlineSecurity, OnlineBackup,
TechSupport, StreamingTV, and
StreamingMovies generally have
DeviceProtection=Yes with full confidence. This association
is found in approximately 2.57% of the transactions.
Rule 4: For those without
OnlineSecurity, DeviceProtection,
TechSupport, StreamingTV, and
StreamingMovies, there’s a very strong indication (100%
confidence) they have availed OnlineBackup=Yes. This occurs
in roughly 4.33% of the transactions.
Rule 5: Customers who have
OnlineSecurity=Yes but do not utilize
OnlineBackup, DeviceProtection,
TechSupport, and StreamingTV are likely
(82.94% confidence) to not use StreamingMovies. This
combination appears in around 3.63% of the total transactions.
Rule 6: Those with
OnlineSecurity=Yes, and who do not have
OnlineBackup, DeviceProtection,
TechSupport, and StreamingMovies, are probably
(81.78% confidence) not using StreamingTV. This pattern is
seen in about 3.63% of the transactions.
A common theme in most of these rules is that customers who refrain from using a range of services are very likely to use specific ones. This indicates the presence of specific user segments having precise requirements or preferences.
High-confidence rules, like the first four, are especially valuable for targeted marketing or product recommendations since the patterns are very consistent.
The lift values, which are greater than 1 for all the rules, suggest that the associations are not just random occurrences but have significant patterns in the dataset.
The Apriori algorithm for association rule mining is a popular choice for building a recommender systems, but since we only have access to the demographics of the customer, ie gender, age range, partnership status, and dependents, without the information about other products purchased, the Apriori algorithm might not be the best choice. Since we saw in the earlier section that there is a presence of specific user segments having precise requirements or preferences, using only demographics as predictors for all other features is extremely unlikely to succeed, hence we decide to skip this.
By leveraging cluster analysis, we can segment our customers into distinct groups based on their purchasing behaviors and preferences. These clusters can then be used to tailor marketing strategies, design custom offers, or even predict future behaviors.
For this analysis, we’ll start by preparing our data which involves scaling and normalizing the relevant features since generally clustering algorithms are sensitive to the scale of the data.
To decide on the appropriate number of clusters, we’ll employ the elbow method where we visualize the total within-cluster sum of squares against a range of potential cluster counts. An elbow in the curve often reveals the most suitable number of clusters, representing a point of diminishing returns where increasing the number of clusters doesn’t provide much better fit to the data.
churn_scaled <- churn %>%
select(PhoneService, MultipleLines, InternetService, OnlineSecurity,
OnlineBackup, DeviceProtection, TechSupport, StreamingTV,
StreamingMovies, Contract, PaperlessBilling, PaymentMethod,
MonthlyCharges, TotalCharges) %>%
mutate(across(where(is.numeric), scale))
set.seed(123)
# To decide on the number of clusters
wss <- sapply(1:15, function(k){
kmeans(churn_scaled, centers=k)$tot.withinss
})
plot(1:15, wss, type="b", pch=19, frame=FALSE,
xlab="Number of clusters K", ylab="Total within-clusters sum of squares")
Having identified that \(K=5\) is the optimal number of clusters, we’ll now proceed to actually cluster our scaled data using the K-medoids algorithm.
Setting a seed ensures that our results are reproducible. This is especially crucial in iterative algorithms like K-medoids where the starting points can influence the final cluster assignments.
set.seed(123) # To reproduce results
kmedoids_result <- pam(churn_scaled, k=5)
# Assign clusters to data
churn_scaled$clusters <- kmedoids_result$clustering
Utilizing K-medoids clustering, we calculate the distance of every data point to its cluster’s medoid. By setting a threshold at the \(90\)’th percentile of these distances, we identify the outliers in our data. Points surpassing this threshold are deemed outliers for that particular cluster.
library(purrr)
# Function to compute distance from a data point to its medoid
compute_distance_to_medoid <- function(row, cluster) {
as.numeric(dist(rbind(row, kmedoids_result$medoids[cluster,])))
}
# Compute distances
churn_scaled$distance_to_medoid <- map2_dbl(
1:nrow(churn_scaled), churn_scaled$clusters,
~ compute_distance_to_medoid(churn_scaled[.x, -c((ncol(churn_scaled)-1):ncol(churn_scaled))], .y))
cluster_thresholds <- churn_scaled %>%
group_by(clusters) %>%
summarise(threshold = quantile(distance_to_medoid, 0.90))
churn_scaled <- left_join(churn_scaled, cluster_thresholds, by = "clusters")
churn_scaled$outlier <- ifelse(churn_scaled$distance_to_medoid > churn_scaled$threshold, 1, 0)
# Filter outliers
outliers <- churn_scaled %>%
filter(outlier == 1)
cat("Total Number of Outliers: ", nrow(outliers))
Total Number of Outliers: 705
Using UMAP (explained later), a dimensionality reduction technique, we visually project our clusters onto a 2D space. This visualization displays each data point according to its assigned cluster, and outliers are distinctly highlighted, providing an immediate overview of cluster distribution and anomalies.
library(umap)
library(ggplot2)
# Apply UMAP
umap_results <- umap(churn_scaled %>% model.matrix(~.-1, .))
# Combine UMAP results with cluster assignments
umap_df <- data.frame(UMAP1 = umap_results$layout[,1],
UMAP2 = umap_results$layout[,2],
cluster = as.factor(churn_scaled$cluster))
umap_df$outlier <- churn_scaled$outlier
# Plot using ggplot2
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = cluster, shape = as.factor(outlier))) +
geom_point(alpha = 0.7) +
scale_shape_manual(values = c(16, 3), name = "Outlier", labels = c("No", "Yes")) +
scale_color_manual(values = c("red", "yellow", "green", "blue", "black")) +
labs(title = "UMAP Projection with K-medoids Clusters",
color = "Cluster") +
theme_minimal()
We can clearly see the outliers in the dataset, however the clusters formed are not clearly separable, but still do offer some visualization for the data.
# Load necessary libraries
library(GGally)
library(vcd)
# Assuming the dataframe is named churn
# Numeric correlation
numeric_columns <- churn[, c("tenure", "MonthlyCharges", "TotalCharges")]
numeric_correlation <- cor(numeric_columns)
print("Numeric correlations:")
[1] "Numeric correlations:"
print(numeric_correlation)
tenure MonthlyCharges TotalCharges
tenure 1.0000000 0.2478999 0.8261642
MonthlyCharges 0.2478999 1.0000000 0.6511820
TotalCharges 0.8261642 0.6511820 1.0000000
# Function to compute point-biserial correlation
point_biserial <- function(factor_var, numeric_var) {
cor.test(as.numeric(factor_var), numeric_var, method = "pearson")$estimate
}
factor_cols <- c("gender", "SeniorCitizen", "Partner", "Dependents", "PhoneService", "MultipleLines", "InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling", "PaymentMethod", "Churn")
numeric_cols <- c("tenure", "MonthlyCharges", "TotalCharges")
# Categorical vs Numeric correlation
for(f in factor_cols) {
for(n in numeric_cols) {
correlation <- point_biserial(churn[[f]], churn[[n]])
if(abs(correlation)>0.5){
cat(sprintf("Correlation between %s and %s: %f\n", f, n, correlation))
}
}
}
Correlation between OnlineSecurity and MonthlyCharges: -0.621227
Correlation between OnlineBackup and MonthlyCharges: -0.710477
Correlation between OnlineBackup and TotalCharges: -0.537234
Correlation between DeviceProtection and MonthlyCharges: -0.513440
Correlation between TechSupport and MonthlyCharges: -0.597594
Correlation between Contract and tenure: 0.671607
# Function to compute Cramér's V
cramers_v <- function(var1, var2) {
tbl <- table(var1, var2)
assocstats(tbl)$cramer
}
# Categorical vs Categorical correlation
for(i in 1:(length(factor_cols) - 1)) {
for(j in (i + 1):length(factor_cols)) {
correlation <- cramers_v(churn[[factor_cols[i]]], churn[[factor_cols[j]]])
if(abs(correlation)>0.7){
cat(sprintf("Cramér's V between %s and %s: %f\n", factor_cols[i], factor_cols[j], correlation))
}
}
}
Cramér's V between PhoneService and MultipleLines: 1.000000
Cramér's V between InternetService and OnlineSecurity: 0.724466
Cramér's V between InternetService and OnlineBackup: 0.707184
Cramér's V between InternetService and DeviceProtection: 0.707108
Cramér's V between InternetService and TechSupport: 0.722904
Cramér's V between InternetService and StreamingTV: 0.717099
Cramér's V between InternetService and StreamingMovies: 0.716008
Cramér's V between OnlineSecurity and OnlineBackup: 0.718434
Cramér's V between OnlineSecurity and DeviceProtection: 0.717291
Cramér's V between OnlineSecurity and TechSupport: 0.733071
Cramér's V between OnlineSecurity and StreamingTV: 0.707788
Cramér's V between OnlineSecurity and StreamingMovies: 0.708208
Cramér's V between OnlineBackup and DeviceProtection: 0.719125
Cramér's V between OnlineBackup and TechSupport: 0.719834
Cramér's V between OnlineBackup and StreamingTV: 0.714699
Cramér's V between OnlineBackup and StreamingMovies: 0.713682
Cramér's V between DeviceProtection and TechSupport: 0.726485
Cramér's V between DeviceProtection and StreamingTV: 0.733661
Cramér's V between DeviceProtection and StreamingMovies: 0.736047
Cramér's V between TechSupport and StreamingTV: 0.716266
Cramér's V between TechSupport and StreamingMovies: 0.716327
Cramér's V between StreamingTV and StreamingMovies: 0.771042
The correlation coefficients between the numeric variables are as follows:
For correlation between numeric and categorical (factor) variables, we use point-biserial correlation.The strong correlations (greater than 0.5 in absolute value) are:
Cramér’s V is a statistic for measuring the strength of association among categorical variables. The following pairs have strong correlations (greater than 0.7):
Tenure is a significant driver of total charges,
as seen from its strong correlation with the TotalCharges
column.
There are specific internet-related services (like
OnlineBackup, OnlineSecurity,
DeviceProtection, and TechSupport) that are
negatively correlated with monthly charges, indicating bundled or
discounted plans.
The high degree of inter-correlation among internet-related services suggests that customers often opt for multiple services together.
Customers who stream TV are very likely to stream movies, indicating an overlap in the customer base for these two services.
This analysis provides insights into customer behavior and preferences, which can be invaluable for targeted marketing and tailoring service offerings.
Dimensionality reduction is a technique used to reduce the number of features in a dataset while retaining as much information as possible.
PCA is a method for dimensionality reduction by identifying axes, or principal components, that maximize dataset variance:
For centered data, PCA is like performing SVD on \(A\)’s covariance matrix. Thus, PCA’s principal components align with SVD’s singular vectors, and PCA’s eigenvalues square to SVD’s singular values.
SVD decomposes a matrix \(A\) into \(U\), \(\Sigma\), and \(V^T\):
\[ A = U \Sigma V^T \]
Where: - \(U\): “left singular vectors”. - \(\Sigma\): diagonal matrix with decreasing “singular values”. - \(V^T\): “right singular vectors”.
In data analysis: - Columns of \(U\) show patterns in \(A\)’s rows. - Columns of \(V\) indicate patterns in \(A\)’s columns. - \(\Sigma\)’s singular values denote pattern significance.
churn_processed <- churn %>%
select(-customerID) %>% # Remove ID since it's unique and not useful for clustering
mutate_at(vars(tenure, MonthlyCharges, TotalCharges), scale) %>% # Standardizing
model.matrix(~.-1, .) # One-hot encoding
library(stats)
library(plotly)
library(ggfortify)
# Applying PCA
pca_result <- prcomp(churn_processed, center = TRUE, scale. = TRUE)
# Extract variance explained by each principal component
explained_var <- pca_result$sdev^2
explained_var_ratio <- explained_var / sum(explained_var)
# Now, plot the biplot:
autoplot(pca_result,
data = churn_processed,
loadings = TRUE,
loadings.label = TRUE,
loadings.label.size = 2,
label = FALSE,
scale = 0,
alpha = 0.3) +
theme_minimal() +
labs(title = "PCA Biplot",
x = paste("PC1:", round(explained_var_ratio[1]*100, 2), "% variance"),
y = paste("PC2:", round(explained_var_ratio[2]*100, 2), "% variance")) +
theme(legend.position = "right")
The biplot shows arrows representing the original variables in the data. The length of an arrow indicates the strength of a variable with respect to the principal components. The angle between arrows reflects the correlation between variables; a small angle means the variables are positively correlated, while a 180-degree angle means they are negatively correlated.
In the above plot, it is interesting to observe that
Contract, Dependents and tenure
are highly correlated which might hint that customers with dependants
are likely to opt for a contract and stay longer. It also tells us that
the first and the second principal component explain \(31\%\) and \(12\%\) variance of the data.
# Data frame for explained variance
explained_df <- data.frame(component = 1:length(explained_var_ratio),
variance = explained_var_ratio)
# ggplot2 Scree plot
ggplot(explained_df, aes(x = component, y = variance)) +
geom_line(color = "red") +
geom_point(color = "red", size = 3) +
labs(title = "Scree Plot",
x = "Principal Component",
y = "Proportion of Variance Explained") +
theme_minimal()
The Scree Plot displays the variance explained by each principal component and clearly shows that most of the variance is explained by the first \(k=3\) principal components. Hence for visualization, we only consider these principal components.
# Extracting the first three principal components for 3D visualization
df_pca <- as.data.frame(pca_result$x[,1:3]) %>%
mutate(churn = churn$Churn)
# 3D visualization using plotly
plot_ly(df_pca, x = ~PC1, y = ~PC2, z = ~PC3, color = ~churn,
type = "scatter3d", mode = "markers", marker = list(size = 3)) %>%
layout(title = "3D PCA visualization of churn data")
The 3D plot is insightful, highlighting two distinct linearly separable subgroups in the data. Within these subgroups, specific boundary regions indicate areas where customer churn can be anticipated.
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a dimensionality reduction method used mainly for visualization of high-dimensional data. Retains local structures, meaning nearby instances in the original space remain close in the reduced space. Unlike PCA, t-SNE better handles the crowding of points, but can get trapped in local minima because its cost function is non-convex.
Here’s a brief about t-SNE: - In the high-dimensional space, t-SNE calculates pairwise similarities using a Gaussian distribution. - In the low-dimensional space, it models similarity using a t-distribution. - It then minimizes the divergence between the two distributions using gradient descent.
library(Rtsne)
tsne_result <- Rtsne(churn_processed, dims = 2, perplexity = 30, check_duplicates = FALSE)
# Visualization
df_tsne <- as.data.frame(tsne_result$Y) %>%
mutate(churn = churn$Churn)
ggplot(df_tsne, aes(x = V1, y = V2, color = churn)) +
geom_point(alpha = 0.5) +
labs(title = "t-SNE visualization of churn data",
x = "Dimension 1",
y = "Dimension 2") +
theme_minimal()
The plot reveals several distinct subgroups. Similar to previous observations, the embeddings indicate specific boundary regions where customer churn predictions are feasible.
# Apply t-SNE for 3 dimensions
tsne_result_3D <- Rtsne(churn_processed, dims = 3, perplexity = 30, check_duplicates = FALSE)
# Convert to a dataframe for visualization
df_tsne_3D <- as.data.frame(tsne_result_3D$Y) %>%
mutate(churn = churn$Churn)
# Visualize using plotly
plot_ly(df_tsne_3D, x = ~V1, y = ~V2, z = ~V3,
color = ~churn,
type = "scatter3d",
mode = "markers",
marker = list(size = 2, opacity = 0.6)
) %>%
layout(title = "3D t-SNE visualization of churn data")
This 3D t-SNE visualization again shows us quite a few sub groups and once again the embeddings have specific boundary regions with customer churn.
UMAP (Uniform Manifold Approximation and Projection) is another dimensionality reduction technique, akin to t-SNE but often faster and more scalable. The algorithm is based on the idea of approximating the geometric structure in high-dimensional space and mapping it into a lower-dimensional space. UMAP maintains both the local and more of the global data structure compared to t-SNE.
Here’s a brief about UMAP:
Construct a Fuzzy Topological Representation: UMAP begins by constructing a high-dimensional graph representation of the data. For each point, it identifies its neighboring points (typically using the nearest neighbors algorithm). Instead of creating a strict “k-nearest neighbors” graph, UMAP constructs a fuzzy simplicial set: points are interconnected with different strengths, representing how “close” they are in terms of the local metric.
Optimize a Low-Dimensional Representation: UMAP then optimizes a low-dimensional representation of this high-dimensional graph. This is done by minimizing the cross-entropy between the high-dimensional and low-dimensional representations. This step is similar to t-SNE in the sense of trying to maintain the local and global structure, but it uses a different approach to achieve this.
umap_result <- umap(churn_processed, n_components = 2)
# Visualization
df_umap <- as.data.frame(umap_result$layout) %>%
mutate(churn = churn$Churn)
# Visualization
ggplot(df_umap, aes(x = V1, y = V2, color = churn)) +
geom_point(alpha = 0.5) +
labs(title = "UMAP visualization of churn data",
x = "Dimension 1",
y = "Dimension 2") +
theme_minimal()
umap_result_3d <- umap(churn_processed, n_components = 3)
df_umap_3d <- as.data.frame(umap_result_3d$layout) %>%
mutate(churn = churn$Churn)
# Create the 3D plot
plot_ly(df_umap_3d, x = ~V1, y = ~V2, z = ~V3, color = ~churn,
type = "scatter3d", mode = "markers", marker = list(size = 3)) %>%
layout(title = "3D UMAP visualization of churn data")
This is a very informative visualization which tells us that there are two linearly separable groups and in each of these groups, there are \(3\) and \(5\) clusters present respectively. This also hints that the visualization of the PCA done above might collapse each of these clusters into the two linearly separable groups.
For reliable model performance, we divide our dataset into training and testing splits with a ratio of \(70-30\) which ensures sufficient training data while allowing for a robust model evaluation.
# Perform the train-test split (e.g., 70% training and 30% testing)
set.seed(123) # For reproducibility
sample_index <- sample(nrow(churn), 0.7 * nrow(churn))
train_data <- churn[sample_index, ]
test_data <- churn[-sample_index, ]
Selecting an ML model isn’t about chasing the one with the highest accuracy blindly. It’s vital to align the model’s strengths with the problem at hand, ensuring it can capture the underlying patterns and relationships in the data. For our churn prediction task, the Random Forest model emerges as a standout choice. Random Forest is an ensemble learning method that combines the predictive power of multiple decision trees, making it robust, accurate, and suitable for both numerical and categorical data, which aligns well with the features in our churn dataset.
library(randomForest)
# Define the predictor variables (features) and the target variable (Churn)
predictor_columns <- c("gender", "SeniorCitizen", "Partner", "Dependents",
"tenure", "PhoneService", "MultipleLines", "InternetService",
"OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport",
"StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling",
"PaymentMethod", "MonthlyCharges", "TotalCharges")
target_column <- "Churn"
# Train the Random Forest model
rf_model <- randomForest(factor(Churn) ~ ., data = train_data[, c(predictor_columns, target_column)], ntree = 200)
saveRDS(rf_model, "../models/ML_model/randomForest_model.rda")
Mean Decrease in Gini Score is a metric that measures how each
variable contributes to the homogeneity of nodes and leaves across the
ensemble of decision trees. A higher ‘Mean Decrease in Gini’ score
signifies a more substantial impact on reducing impurity and enhancing
the model’s ability to make accurate predictions. Notably, we observe
that TotalCharges,tenure,
MonthlyCharges, and Contract exhibit the
highest values of Mean Decrease in Gini Score.
varImpPlot(rf_model, main = "Random Forest Feature Importance")
The plot highlights that the features TotalCharges,
tenure, and MonthlyCharges emerge as the most
influential. Notably, there’s a pronounced drop in importance when
moving to the feature Contract.
The Out-of-Bag (OOB) error is a method of measuring the prediction error of ensemble models, particularly Random Forests. Here’s a brief overview:
OOB error provides a handy estimate of the generalization error and helps gauge the model’s performance without additional cross-validation.
oob_error <- rf_model$err.rate[nrow(rf_model$err.rate), "OOB"]
cat("Out of Bag Error:", oob_error, "\n")
Out of Bag Error: 0.2079108
Now that we have trained our random forest model
(rf_model), we assess its performance on a separate
held-out test set.
library(ROCR)
# Make predictions on the test set
predictions <- predict(rf_model, newdata = test_data[predictor_columns], type = "response")
# Calculate accuracy
accuracy <- mean(predictions == test_data$Churn)
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.8069096
# Calculate AUROC
# Create a prediction object for ROC analysis
prediction_obj <- prediction(as.numeric(predictions), test_data$Churn)
# Calculate AUROC
performance <- performance(prediction_obj, measure = "auc")
auroc <- as.numeric(performance@y.values)
cat("AUROC:", auroc, "\n")
AUROC: 0.7050047
conf_matrix <- table(Actual = test_data$Churn, Predicted = predictions)
precision <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
recall <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
f1_score <- 2 * (precision * recall) / (precision + recall)
cat("Precision:", precision, "\n")
Precision: 0.6304348
cat("Recall:", recall, "\n")
Recall: 0.505814
cat("F1 Score:", f1_score, "\n")
F1 Score: 0.5612903
# Display confusion matrix
print(conf_matrix)
Predicted
Actual 0 1
0 1444 153
1 255 261
Based on the random forest model, the results on the test set are as follows:
Accuracy: The model’s accuracy is approximately \(80.69\%\). This means that, out of all the predictions made by the model, around \(80.69\%\) were correct. It’s a relatively high accuracy, suggesting that the model is doing well in generalizing to unseen data.
AUROC: The AUROC value is about \(0.705\). The AUROC evaluates the model’s ability to distinguish between the positive (churned) and negative (non-churned) classes. An AUROC score of \(0.705\) is considered moderate. While it’s above the no-discrimination line of \(0.5\), there is room for improvement. It would be ideal for this score to be closer to \(1\), indicating perfect discrimination.
Precision: The precision is approximately \(0.6304\), meaning that when the model predicts a customer will churn, it’s correct about \(63.04\%\) of the time.
Recall (or Sensitivity): The recall is around \(0.5058\), suggesting that the model successfully identifies \(50.58\%\) of all actual churn cases. This indicates that the model is missing out on predicting almost half of the customers who do churn.
F1 Score: The F1 score, which is the harmonic mean of precision and recall, stands at approximately \(0.5613\). An F1 score closer to \(1\) is ideal, and this current value indicates that there is a balance between precision and recall, but there’s potential for improvement.
Confusion Matrix:
From the confusion matrix:
True Negatives (TN): 1444 - Customers who were predicted not to churn and did not churn.
False Positives (FP): 153 - Customers who were predicted to churn but did not.
False Negatives (FN): 255 - Customers who were predicted not to churn but did churn.
True Positives (TP): 261 - Customers who were predicted to churn and actually did.
Hence while the model exhibits a decent accuracy, some metrics like recall and AUROC suggest that there’s potential for further optimization!